Day 06-Feature Engineering -- 2. Categorical Encoding(5)

第 12 屆 iThome 鐵人賽

DAY 6

AI & Data

Machine Learning系列第 6 篇

12th鐵人賽

tjabi

2020-09-06 23:27:46

2378 瀏覽

分享至

2.1 One hot encoding
2.2 Count and Frequency encoding
2.3 Target encoding / Mean encoding
2.4 Ordinal encoding
2.5 Weight of Evidence
2.6 Rare label encoding
2.7 Helmert encoding
2.8 Probability Ratio Encoding
2.9 Label encoding
2.10 Feature hashing
2.11 Binary encoding & BaseN encoding
2.12 Sum Encoder(Deviation Encoding of Effect Encoding)
2.13 Backward Difference
2.14 Polynomial
2.15 Leave One Out
2.16 James_Stein
2.17 M-estimator
2.18 CatBoost encoding

將使用這個data-frame，有兩個獨立變數或特徵(features)和一個標籤(label or Target)，共有十筆資料。
Rec-No | Temperature | Color | Target |
------------- | -------------------------- | -------------
0 | Hot | Red | 1
1 | Cold | Yellow | 1
2 | Very Hot | Blue | 1
3 | Warm | Blue | 0
4 | Hot | Red | 1
5 | Warm | Yellow | 0
6 | Warm | Red | 1
7 | Hot | Yellow | 0
8 | Hot | Yellow | 1
9 | Cold | Yellow | 1

2.12 Sum Encoder(Deviation Encoding of Effect Encoding)

比較一個變數中某一類別的平均值和全部類別的平均值。

import category_encoders as ce

Sum_encoder = ce.SumEncoder(cols=['Temperature'])
df_se = Sum_encoder.fit_transform(df['Temperature'])
df_se.columns = ['se_'+str(i) for i in df_se.columns]
df = pd.concat([df, df_se], axis=1)
df

/ | Temperature | Color | Target |se_intercept|se_Temperature_0| se_Temperature_1| se_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | 1.0 | 0.0 | 0.0
1 | Cold | Yellow | 1|1 | 0.0 | 1.0 | 0.0
2 | Very Hot | Blue | 1|1 | 0.0 | 0.0 | 1.0
3 | Warm | Blue | 0|1 | -1.0 | -1.0 | -1.0
4 | Hot | Red | 1|1 | 1.0 | 0.0 | 0.0
5 | Warm | Yellow | 0|1 | -1.0 | -1.0 | -1.0
6 | Warm | Red | 1|1 | -1.0 | -1.0 | -1.0
7 | Hot | Yellow | 0|1 | 1.0 | 0.0 | 0.0
8 | Hot | Yellow | 1|1 | 1.0 | 0.0 | 0.0
9 | Cold | Yellow | 1|1 | 0.0 | 1.0 | 0.0

2.13 Backward Difference

Backward Difference Encoding 是一變數中，某一類別的平均值和其之前類別的平均值。這個方法對於名目變數(nominal)或順序變數(ordinal)較有效益。

ce_backward = ce.BackwardDifferenceEncoder(cols=['Temperature'])
df_ce = ce_backward.fit_transform(df['Temperature'])
df_ce.columns = ['bk_'+str(i) for i in df_ce.columns]
df = pd.concat([df, df_ce], axis=1)
df

/ | Temperature | Color | Target |bk_intercept|bk_Temperature_0| bk_Temperature_1| bk_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | -0.75 | -0.5 | -0.25
1 | Cold | Yellow | 1|1 | 0.25 | -0.5 | -0.25
2 | Very Hot | Blue | 1|1 | 0.25 | 0.5 | -0.25
q3 | Warm | Blue | 0|1 | 0.25 | 0.5 | 0.75
4 | Hot | Red | 1|1 | -0.75 | -0.5 | -0.25
5 | Warm | Yellow | 0|1 | 0.25 | 0.5 | 0.75
6 | Warm | Red | 1|1 | 0.25 | 0.5 | 0.75
7 | Hot | Yellow | 0|1 | -0.75 | -0.5 | -0.25
8 | Hot | Yellow | 1|1 | -0.75 | -0.5 | -0.25
9 | Cold | Yellow | 1|1 | 0.25 | -0.5 | -0.25

2.14 Polynomial

Polynomial coding 是一個較少用的方法，但他卻是一個最能反映變數資訊的方法。polynomial coding的目的是要辨識 dependent 和 independent variables之間的線性和非線性關係的傾向。尋找類別變數的 linear, quadratic and cubic 傾向。

ce_poly = ce.PolynomialEncoder(cols=['Temperature'])
dfp = ce_poly.fit_transform(df['Temperature'])
dfp.columns = ['poly_'+str(i) for i in dfp.columns]
df = pd.concat([df, dfp], axis=1)
df

/ | Temperature | Color | Target | poly_intercept | poly_Temperature_0 | poly_Temperature_1 | poly_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | -0.670820 | 0.5 |-0.223607
1 | Cold | Yellow | 1|1 |-0.223607 | -0.5 | 0.670820
2 | Very Hot | Blue | 1|1 |0.223607 | -0.5 | -0.670820
3 | Warm | Blue | 0|1 | 0.670820| 0.5| 0.223607
4 | Hot | Red | 1|1 |-0.670820 | 0.5 | -0.223607
5 | Warm | Yellow | 0|1 |0.670820| 0.5| 0.223607
6 | Warm | Red | 1|1 | 0.670820| 0.5| 0.223607
7 | Hot | Yellow | 0|1 |-0.670820 | 0.5 | -0.223607
8 | Hot | Yellow | 1|1 |-0.670820 | 0.5 |-0.223607
9 | Cold | Yellow | 1|1 | -0.223607 | -0.5 | 0.670820

2.15 Leave One Out

類似 Target Encoding。但我們會排除目前資料列的標籤(target)，當我們在計算每一類別對應的標籤(target)的平均值時。這這樣可以降低outliers效應

X = df.drop(['Target'], axis=1)
y = df['Target']
ce_leave = ce.LeaveOneOutEncoder(cols=['Temperature'])
dfl = ce_leave.fit_transform(X, y)
dfl

/	Temperature	Color
0	0.666667	Red
1	1.000000	Yellow
2	0.700000	Blue
3	0.500000	Blue
4	0.666667	Red
5	0.500000	Yellow
6	0.000000	Red
7	1.000000	Yellow
8	0.666667	Yellow
9	1.000000	Yellow

2.16 James_Stein Encoding

James_Stein Encoding 是一個 target-based encoder。類似 target encoding，但它產生的值會趨向類別變數對應標籤的群體平均數，所以它是個別類別對應標籤的平均值和群體平均數的加權總數。

James_Stein estimator 有一個實際上的限制：它是為平均分配(normal distributions)設計的，所以不適合分類(classification)機器學習模型，要克服這個問題，我們可以將二進位元的標籤(Target)轉換為 log-odds ratio 或者使用 beta 分配(beta distribution)。

ce_James = ce.JamesSteinEncoder(cols=['Temperature'])
dfj = ce_James.fit_transform(X, y)
dfj

/	Temperature	Color
0	0.741379	Red
1	1.000000	Yellow
2	1.000000	Blue
3	0.405229	Blue
4	0.741379	Red
5	0.405229	Yellow
6	0.405229	Red
7	0.741379	Yellow
8	0.741379	Yellow
9	1.000000	Yellow

2.17 M-estimator

M-Estimate Encoder 是一個簡單版的 Target Encoder，類似 James Stein encoder，但使用一個有額外的參數(m)的群體平均數來調整每一個類別對應標籤的平均值，這個參數的預設值是 1。

ce_M_estimator = ce.MEstimateEncoder(cols=['Temperature'])
dfM = ce_M_estimator.fit_transform(X, y)
dfM

/	Temperature	Color
0	0.740	Red
1	0.900	Yellow
2	0.850	Blue
3	0.425	Blue
4	0.740	Red
5	0.425	Yellow
6	0.425	Red
7	0.740	Yellow
8	0.740	Yellow
9	0.900	Yellow

Day-5 Feature Engineering -- 2. Categorical Encoding(4)

Day7-Feature Engineering -- 2. Categorical Encoding(6)

系列文

Machine Learning 共 32 篇

RSS系列文訂閱系列文

23 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22201 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

Machine Learning系列 第 6 篇